Proposal#1
Conversation
57a129d to
d16c779
Compare
|
Wow, I just spent the past hour going through your proposal and it looks really great. I like that you have made all the memory implementation easier to understand. The code reads very rusty. I also liked the fact that most of the things are documented and well explained. I think it does make sense to think about implementing this as the base for arrow, specially with the improvement in performance that you are reporting. Regarding the NativeType trait and its implementations, I couldn't understand why it has to be unsafe. Do you mind explaining that to me? |
The main difference happens in interactions with lower-end functionality, yes. But AFAIK folks at UrbanLogic (@maxburke) use them. For higher-end functionality, the main change is that creating a primitive from an iterator has some more characters: let array = iter.map(...).collect::<Primitive<i32>>().to(DataType::Date32);vs let array = iter.map(...).collect::<Date32Array>();this is derived by the split between logical and physical parts of the array, so that we can e.g. create timezone-aware timestamps: let array = iter.map(...).collect::<Primitive<i64>>().to(DataType::Timestamp(a, b));which is not possible in current arrow without performing a cast the array or using
As you can see this really has a major impact in the crate as a whole, which is why is so hard to implement. |
The arrow specification assumes that a buffer can only be of certain types, and requires specific memory alignments for these. We use that to safely transfer data over FFI boundaries, write to parquet, IPC, etc. Furthermore, our allocator has an optimization on which we assume that Thus, struct A(HashMap<i32, i32>);
impl NativeType for A {};would leak and would also result in undefined behavior. Thus, we mark the trait as |
Gotcha. So it is a warning to someone wanting the implement the trait to another type. Yesterday I was playing with your code and saw that I could remove the unsafe option from the trait and it would compile. Out of curiosity, would it not make sense to keep the trait hidden (not pub) instead of using unsafe? |
Yes, that is the main use-case of
I agree with this sentiment. However, that would not allow people to write generics that depend on it. It is also not possible in Rust to publicly expose generic functions whose type parameters are not public, as it would "leak" a private trait via a public function. There are some discussions to allow a trait to be "sealed", but it is still in RFC phase, which I think would solve this problem. |
|
I guess the sealed traits could be implemented like this example |
e628dcf to
06416ac
Compare
|
Very interesting read and I really itches something that I experienced when starting using Arrow, but got used to overtime due to exposure. What is the main idea, fork and continue as arrow2? Or make a MVP and hope to get that merged into the main project? |
|
I hope we don't end up with a fork. While it will be painful I think starting to break this PR up into pieces and bring it in incrementally into the main arrow codebase would be the most plausible way to bring the idea to fruition. |
|
I also hope so, and I am working towards a plan to have this merged on the main repo. Technically, the core hypothesis that I am testing atm is that the way this repo handles offsets works in integration with at least IPC integration tests. I am really uncertain here. This PR uses a different approach to offsets, as they are no longer tracked only on ArrayData and instead each Buffer/Bitmap tracks its own, the same way Tokyo-bytes does it. This is dramatically easier to work with, but I am unsure whether it will work with IPC and FFI. If it does not work, I may have to revisit this whole thing. I working towards having the json IO migrated to this repo, as the integration tests for IPC use json. |
|
Another thing that could help to get traction behind your idea is to have more performance comparisons between actual Arrow and the new implementation. I know you have put a lot of time into this, so if you would like help testing something let me know and I can help with that. A list of the things that you want to test could be useful se we can work on that. Sorry to piggy-bag on this thread to ask something related to your NativeType trait implementation, but would using a private module (example) have the safe effect you are looking for sealing the traits? |
|
Sorry for the late reply, but last weeks have been pretty busy. However, I did have time to work on this. I have now concluded the feasibility study that I wanted to do on this.
I thus conclude that the biggest risk for this endeavor is regressions on the parquet IO. Recommended actions:
|
I think this is an excellent idea. |
|
I am closing this PR, as I will start the work of making this repo usable and stuff. The ideas above hold, but I plan to use this repo to have the version of the code with a transmute-free implementation of arrow. |
|
The repo now contains the design and implementation that I have been baking over the past months. Some modules have a README with the actual design notes of them (i.e. MUST, MAY, SHOULD, etc). The repo has most things implemented with the notable exception of parquet IO, which I am still trying to grasp. I feature-gated almost everything so that the crate depends on 3 small dependencies and chrono. Many, many components were re-written from scratch because they left me no other choice. I also deprecated the "Builder" API, as it is entirely replaced by a The specification is always validated when an array is created and there is little room for unsoundness. |
|
😮 - so now the next question is, "what next and what can we do to help @jorgecarleitao "? |
See README.md